Of the 1000 films in the database, the average film costs about $19.98 to replace and the individual costs range from $9.99 to $29.99. To get an intuition of why films might be more expensive to replace, it can be helpful to explore the table (Figure 1) below where films are sorted by their replacement cost. It is difficult to learn much this way but as will be shown, there may not even be a definitive connection between a film’s characteristics and its replacement cost.
Note: Language ID and Release Year each only have a single unique value across all of the data.
Figure 1
The histogram of replacement costs (Figure 2) reveals the lack of a clear pattern among the replacement costs. To further explore what drives replacement cost, we can look at its relationship to other variables.
Figure 2
Figure 3 is evidence that there is not a linear relationship between the numerical characteristics of a film (rental duration, rental rate, and length), and its replacement cost. The variable with the correlation of the largest magnitude with replacement cost is rental rate at -0.0446. This is a very weak correlation and an additional reason why it would be warranted to explore the relationship between replacement cost and the categorical characteristics of films.
Figure 3
Figures 4 and 5 are box plots depicting the distribution of replacement cost relative to the unique values of the two categorical characteristics of films in our data: film rating and special features. Curiously, both the minimum and maximum replacement cost associated with each value for both of these characteristics are almost identical. Combining this with the similarities in interquartile range of replacement cost across the characteristics, it suggests that the categorical valuables are also not clearly connected to the replacement cost.
Note: For Figure 4, hover your cursor over individual box plots to better see the special feature groupings that they correspond to.
Figure 4
Figure 5
Conclusion
After this preliminary analysis, there are no easily identifiable connections between a film’s characteristics and its replacement cost. For future analysis it might be helpful to model non-linear relationships and interactions between variables as they relate to replacement cost. This could be done by fitting a tree-based model and interpreting the feature importances or shapley values.
The rental data covers the time period between May 24th, 2005 to February 14th 2006. There is also an alarming 5-month gap in the data between August 2005 and February 2006. This can be seen in Figures 6 and 8. Such a gap, with 0 rentals, suggests a serious, long-term issue with the data pipeline.
Ignoring this error and the small amount of data (16044 rentals across nearly 10 months), we can still attempt some basic analysis. Looking at Figure 6, it appears that throughout the entire timescale of the data, PG-13 movies are rented most often and G movies are rented least often. Films with other ratings have consistently similar rental counts. This relationship lines up very nicely with the the inventory amounts of films with each rating shown in Figure 7.
The connection between rental counts and inventory continues for categories. This can be seen in Figures 8 and 9. It may be helpful to reference the note here for Figure 9, you can start with the most frequent category in the inventory (Sports) and continually add the next most frequent category.
Note: For Figures 6 and 9, it might be helpful to double click the lines in the legend to view one rating or category at a time. You can toggle on other ratings/categories by clicking once on their line. Be aware that isolating a single rating or category can change the scale of the y axis.
Figure 6
Figure 7
Figure 8
Figure 9
Conclusion
For this question, timescale and data quality are the main factors preventing any more complex analysis. The relationship between inventory amounts and rental counts is fairly consistent for this small time window, even as total rental count changes. If there was more data, it would be interesting to see if this relationship continues. It would also facilitate in identifying any trends or seasonalities, while also allowing for more robust forecasting of rentals.
It is important to point out that although the data shows that films are rented in 597 different cities, there are only 2 unique stores. The store with ID 1 is listed in the Alberta district of Lethbridge. The store with ID 2 is listed in the QLD district of Woodbridge.
In real data the discrepancy between store counts and cities of rentals would raise major questions about the data quality, but this is fake data after all. We can still try to do some analysis on how the 2 stores compare to each other.
Figure 10 shows that, over the time period covered by the data, both stores have a nearly identical pattern of rentals per month, with store 1 consistently outperforming store 2. The previous finding where we saw that more inventory of films leads to more rentals also seems to hold true for the stores. Figure 11 shows that store 1 simply has more inventory than store 2 and this could very well explain the performance difference between the two stores.
Figure 10
Figure 11
I was curious how the two stores compare in terms of rentals over the course of an average day so I created Figure 12. Figure 12 depicts the average number of rentals in each store across every hour of the day. It seems that the stores do have very similar daily rental patterns, with both stores on average seeing a spike in rentals at 3:00 PM. It is also interesting to note that these stores are apparently open 24 hours a day.
Figure 12